pacman::p_load(sf, sfdep, spatstat,
tmap, leaflet,
raster,
tidyverse,
DT, knitr, kableExtra, gtsummary,
plotly)1 Thailand’s Killer Roads
Road traffic injuries pose a significant public health issue in Thailand, with a high number of fatalities, injuries, and disabilities each year. These incidents have a profound impact not only on the victims and their families but also on society and the nation as a whole. According to the World Health Organization (WHO), approximately 20,000 people lose their lives in road accidents annually, equating to about 56 deaths each day.
Despite numerous government initiatives to reduce road casualties, the situation has seen minimal improvement. These issues underscore the need for a deeper understanding of road traffic accidents, which can be largely attributed to two primary factors: behavioral and environmental.
Behavioral factors, such as driver behavior and performance, are significant contributors to traffic accidents. These factors include driving style and skills, as well as direct and indirect driving behaviors. Environmental factors encompass conditions such as poor visibility during adverse weather (e.g., heavy rain or fog) and challenging road features (e.g., sharp bends, slippery slopes, and blind spots).
Previous studies have highlighted the value of Spatial Point Patterns Analysis in examining road traffic accidents. However, these studies often focus solely on either behavioral or environmental factors, neglecting the influence of temporal factors such as season, day of the week, or time of day.
In light of this, it is essential to investigate the factors affecting road traffic accidents in the Bangkok Metropolitan Region (BMR) using both spatial and spatio-temporal point patterns analysis methods. This approach aims to identify the leading causes of accidents and provide insights that can guide the development of effective policies and interventions to enhance road safety.
2 Getting Started
2.1 Loading Packages
In this exercise, we will be using the following packages:
| Package | Description |
|---|---|
| sf | For importing, managing, and handling geospatial data |
| spatstat | For point pattern analysis. In this hands-on exercise, it will be used to perform 1st- and 2nd-order spatial point patterns analysis and derive kernel density estimation (KDE) layer. |
| sfdep | Used to compute spatial weights |
| tmap | For thematic mapping |
| leaflet | For interactive maps |
| tidyverse | For non-spatial data wrangling |
| DT, knitr, kableExtra, gtsummary | For building tables |
| plotly | To create interactive plots |
The following code chunk uses p_load() of pacman package to check if the aforementioned packages are installed in the computer. If they are, the libraries will be called into R.
2.2 The Data
| File | Source | Screenshot |
|---|---|---|
Thailand - SubnationalAdministrativeBoundaries, in SHP file format There are 3 administrative levels in the shapefile: 0 (country), 1 (province), 2 (district), and 3 (sub-district, tambon) boundaries. We will use level 1 to filter for the Bangkok Metropolitan Region. |
Humanitarian Data Exchange | |
| Thailand Roads (OpenStreetMap Export), in SHP file format | Humanitarian Data Exchange | |
Thailand Road Accident [2019-2022], in CSV format This dataset provides a comprehensive statistics on recorded road accidents in Thailand from 2019 to 2022, including time location, accident type, weather condition, and road characteristics. |
Kaggle |
2.2.1 Traffic Accident Data
read_csv() of the readr package allows us to import csv files into R Studio.
accidents <- read_csv("data/geospatial/thai_road_accident_2019_2022.csv")glimpse(accidents)Rows: 81,735
Columns: 18
$ acc_code <dbl> 571905, 3790870, 599075, 571924, 599523, 5…
$ incident_datetime <dttm> 2019-01-01 00:00:00, 2019-01-01 00:03:00,…
$ report_datetime <dttm> 2019-01-02 06:11:00, 2020-02-20 13:48:00,…
$ province_th <chr> "ลพบุรี", "อุบลราชธานี", "ประจวบคีรีขันธ์", "เชียงใ…
$ province_en <chr> "Loburi", "Ubon Ratchathani", "Prachuap Kh…
$ agency <chr> "department of rural roads", "department o…
$ route <chr> "แยกทางหลวงหมายเลข 21 (กม.ที่ 31+000) - บ้านวั…
$ vehicle_type <chr> "motorcycle", "private/passenger car", "mo…
$ presumed_cause <chr> "driving under the influence of alcohol", …
$ accident_type <chr> "other", "rollover/fallen on straight road…
$ number_of_vehicles_involved <dbl> 1, 1, 2, 1, 1, 1, 2, 2, 2, 2, 1, 1, 1, 1, …
$ number_of_fatalities <dbl> 0, 0, 1, 0, 0, 0, 0, 1, 3, 0, 0, 1, 0, 0, …
$ number_of_injuries <dbl> 2, 2, 0, 1, 0, 2, 2, 0, 0, 1, 1, 0, 1, 1, …
$ weather_condition <chr> "clear", "clear", "clear", "clear", "clear…
$ latitude <dbl> 14.959105, 15.210738, 12.374259, 18.601721…
$ longitude <dbl> 100.87346, 104.86269, 99.90795, 98.80420, …
$ road_description <chr> "straight road", "straight road", "wide cu…
$ slope_description <chr> "no slope", "no slope", "slope area", "no …
A quick look into the data using glimpse() of the dplyr package reveals that there are 18 variables in the data, including:
| Column Name | Description |
|---|---|
| acc_code | The accident code or identifier. |
| incident_datetime | The date and time of the accident occurrence. |
| report_datetime | The date and time when the accident was reported. |
| province_th | The name of the province in Thailand, written in Thai. |
| province_en | The name of the province in Thailand, written in English. |
| agency | The government agency responsible for the road and traffic management. |
| route | The route or road segment where the accident occurred. |
| vehicle_type | The type of vehicle involved in the accident. |
| presumed_cause | The presumed cause or reason for the accident. |
| accident_type | The type of nature of the accident. |
| number_of_vehicles_involved | The number of vehicles involved in the accident. |
| number_of_fatalities | The number of fatalities resulting from the accident. |
| number_of_injuries | The number of injuries resulting from the accident. |
| weather_condition | The weather condition at the time of the accident. |
| latitude | The latitude coordinate of the accident location. |
| longitude | The longitude coordinate of the accident location. |
| road_description | The description of the road type or configuration where the accident occurred. |
| slope_description | The description of the slope condition at the accident location. |
Since the province_th and route columns are in Thai, I will remove them. The province column is already available in English, and the route information can be projected using the latitude and longitude columns. Additionally, I will remove the report_datetime column and use incident_datetime for our analysis.
accidents <- accidents %>%
select(-c(province_th,
route,
report_datetime))The gtsummary package provides us with descriptive statistics of the dataset and also includes the amount of missingness in each variable.
| Characteristic | N | N = 81,7351 |
|---|---|---|
| acc_code | 81,735 | 3,824,084 (3,789,460, 5,831,089) |
| incident_datetime | 81,735 | 2019-01-01 to 2022-12-31 23:55:00 |
| province_en | 81,735 | |
| Amnat Charoen | 196 (0.2%) | |
| Ang Thong | 1,125 (1.4%) | |
| Bangkok | 6,439 (7.9%) | |
| buogkan | 583 (0.7%) | |
| Buri Ram | 327 (0.4%) | |
| Chachoengsao | 1,407 (1.7%) | |
| Chai Nat | 692 (0.8%) | |
| Chaiyaphum | 471 (0.6%) | |
| Chanthaburi | 1,360 (1.7%) | |
| Chiang Mai | 3,407 (4.2%) | |
| Chiang Rai | 858 (1.0%) | |
| Chon Buri | 4,120 (5.0%) | |
| Chumphon | 1,039 (1.3%) | |
| Kalasin | 659 (0.8%) | |
| Kamphaeng Phet | 822 (1.0%) | |
| Kanchanaburi | 1,165 (1.4%) | |
| Khon Kaen | 1,224 (1.5%) | |
| Krabi | 741 (0.9%) | |
| Lampang | 1,018 (1.2%) | |
| Lamphun | 744 (0.9%) | |
| Loburi | 1,036 (1.3%) | |
| Loei | 742 (0.9%) | |
| Mae Hong Son | 308 (0.4%) | |
| Maha Sarakham | 832 (1.0%) | |
| Mukdahan | 668 (0.8%) | |
| Nakhon Nayok | 386 (0.5%) | |
| Nakhon Pathom | 891 (1.1%) | |
| Nakhon Phanom | 462 (0.6%) | |
| Nakhon Ratchasima | 3,330 (4.1%) | |
| Nakhon Sawan | 1,403 (1.7%) | |
| Nakhon Si Thammarat | 1,363 (1.7%) | |
| Nan | 931 (1.1%) | |
| Narathiwat | 477 (0.6%) | |
| Nong Bua Lam Phu | 288 (0.4%) | |
| Nong Khai | 297 (0.4%) | |
| Nonthaburi | 827 (1.0%) | |
| Pathum Thani | 1,923 (2.4%) | |
| Pattani | 395 (0.5%) | |
| Phangnga | 328 (0.4%) | |
| Phatthalung | 1,068 (1.3%) | |
| Phayao | 543 (0.7%) | |
| Phetchabun | 1,704 (2.1%) | |
| Phetchaburi | 873 (1.1%) | |
| Phichit | 226 (0.3%) | |
| Phitsanulok | 800 (1.0%) | |
| Phra Nakhon Si Ayutthaya | 1,571 (1.9%) | |
| Phrae | 1,466 (1.8%) | |
| Phuket | 444 (0.5%) | |
| Prachin Buri | 809 (1.0%) | |
| Prachuap Khiri Khan | 1,141 (1.4%) | |
| Ranong | 140 (0.2%) | |
| Ratchaburi | 497 (0.6%) | |
| Rayong | 770 (0.9%) | |
| Roi Et | 721 (0.9%) | |
| Sa Kaeo | 699 (0.9%) | |
| Sakon Nakhon | 973 (1.2%) | |
| Samut Prakan | 2,241 (2.7%) | |
| Samut Sakhon | 1,015 (1.2%) | |
| Samut Songkhram | 524 (0.6%) | |
| Saraburi | 1,158 (1.4%) | |
| Satun | 321 (0.4%) | |
| Si Sa Ket | 739 (0.9%) | |
| Sing Buri | 448 (0.5%) | |
| Songkhla | 1,231 (1.5%) | |
| Sukhothai | 1,305 (1.6%) | |
| Suphan Buri | 3,056 (3.7%) | |
| Surat Thani | 1,983 (2.4%) | |
| Surin | 843 (1.0%) | |
| Tak | 1,859 (2.3%) | |
| Trang | 784 (1.0%) | |
| Trat | 469 (0.6%) | |
| Ubon Ratchathani | 751 (0.9%) | |
| Udon Thani | 946 (1.2%) | |
| unknown | 34 (<0.1%) | |
| Uthai Thani | 503 (0.6%) | |
| Uttaradit | 949 (1.2%) | |
| Yala | 343 (0.4%) | |
| Yasothon | 504 (0.6%) | |
| agency | 81,735 | |
| department of highways | 75,304 (92%) | |
| department of rural roads | 5,115 (6.3%) | |
| expressway authority of thailand | 1,316 (1.6%) | |
| vehicle_type | 81,735 | |
| 4-wheel pickup truck | 28,445 (35%) | |
| 6-wheel truck | 3,019 (3.7%) | |
| 7-10-wheel truck | 2,364 (2.9%) | |
| bicycle | 257 (0.3%) | |
| large passenger vehicle | 421 (0.5%) | |
| large truck with trailer | 6,257 (7.7%) | |
| motorcycle | 14,874 (18%) | |
| motorized tricycle | 294 (0.4%) | |
| other | 3,513 (4.3%) | |
| passenger pickup truck | 325 (0.4%) | |
| pedestrian | 219 (0.3%) | |
| private/passenger car | 20,814 (25%) | |
| three-wheeled vehicle | 28 (<0.1%) | |
| tractor/agricultural vehicle | 63 (<0.1%) | |
| van | 842 (1.0%) | |
| presumed_cause | 81,735 | |
| abrupt lane change | 134 (0.2%) | |
| aggressive driving/overtaking | 12 (<0.1%) | |
| brake/anti-lock brake system failure | 55 (<0.1%) | |
| cutting in closely by people/vehicles/animals | 6,724 (8.2%) | |
| dangerous curve | 61 (<0.1%) | |
| debris/obstruction on the road | 294 (0.4%) | |
| disabled vehicle without proper signals | 37 (<0.1%) | |
| disabled vehicle without proper signals/signs | 5 (<0.1%) | |
| driving in the wrong lane | 37 (<0.1%) | |
| driving under the influence of alcohol | 1,464 (1.8%) | |
| driving without headlights/illumination | 20 (<0.1%) | |
| external disturbance | 2 (<0.1%) | |
| failure to signal enter/exit parking | 43 (<0.1%) | |
| failure to yield right of way | 133 (0.2%) | |
| failure to yield/signal | 215 (0.3%) | |
| falling asleep | 4,700 (5.8%) | |
| ignoring stop sign while leaving intersection | 79 (<0.1%) | |
| illegal overtaking | 445 (0.5%) | |
| inadequate visibility | 13 (<0.1%) | |
| insufficient light | 69 (<0.1%) | |
| internal disturbance | 1 (<0.1%) | |
| loss of control | 49 (<0.1%) | |
| medical condition | 40 (<0.1%) | |
| narrow road | 10 (<0.1%) | |
| navigation equipment failure | 1 (<0.1%) | |
| no presumed cause related to driver | 1 (<0.1%) | |
| no presumed cause related to road conditions | 4 (<0.1%) | |
| no presumed cause related to vehicle conditions | 1 (<0.1%) | |
| no road divider lines | 1 (<0.1%) | |
| no traffic light system | 1 (<0.1%) | |
| no traffic signs | 9 (<0.1%) | |
| obstruction in sight | 7 (<0.1%) | |
| other | 1,576 (1.9%) | |
| overloaded vehicle | 176 (0.2%) | |
| repair/construction on the road | 6 (<0.1%) | |
| reversing vehicle | 70 (<0.1%) | |
| road in poor condition | 7 (<0.1%) | |
| running red lights/traffic signals | 911 (1.1%) | |
| slippery road | 85 (0.1%) | |
| speeding | 60,373 (74%) | |
| straddling lanes | 31 (<0.1%) | |
| sudden stop | 48 (<0.1%) | |
| tailgating | 181 (0.2%) | |
| traffic light system failure | 3 (<0.1%) | |
| turn signal system failure | 2 (<0.1%) | |
| unfamiliarity with the route/unskilled driving | 677 (0.8%) | |
| using mobile phone while driving | 33 (<0.1%) | |
| using psychoactive substances | 1 (<0.1%) | |
| vehicle electrical system failure | 11 (<0.1%) | |
| vehicle equipment failure | 2,747 (3.4%) | |
| worn-out/tire blowout | 127 (0.2%) | |
| เส้นแบ่งทิศทางจราจรชำรุด | 1 (<0.1%) | |
| ป้ายจราจรชำรุด | 1 (<0.1%) | |
| มึนเมาจากแอลกอฮอล์ | 1 (<0.1%) | |
| accident_type | 81,735 | |
| collision at intersection corner | 1,067 (1.3%) | |
| collision during overtaking | 83 (0.1%) | |
| collision with obstruction (on road surface) | 2,742 (3.4%) | |
| head-on collision (not overtaking) | 3,944 (4.8%) | |
| other | 4,188 (5.1%) | |
| pedestrian collision | 945 (1.2%) | |
| rear-end collision | 24,901 (30%) | |
| rollover/fallen on curved road | 10,443 (13%) | |
| rollover/fallen on straight road | 33,046 (40%) | |
| side collision | 336 (0.4%) | |
| turning/retreating collision | 40 (<0.1%) | |
| number_of_vehicles_involved | 81,735 | 1.00 (1.00, 2.00) |
| number_of_fatalities | 81,735 | 0.00 (0.00, 0.00) |
| number_of_injuries | 81,735 | 0.00 (0.00, 1.00) |
| weather_condition | 81,735 | |
| clear | 69,943 (86%) | |
| dark | 368 (0.5%) | |
| foggy | 544 (0.7%) | |
| land slide | 1 (<0.1%) | |
| natural disaster | 47 (<0.1%) | |
| other | 162 (0.2%) | |
| rainy | 10,670 (13%) | |
| latitude | 81,376 | 14.5 (13.5, 16.6) |
| NA | 359 | |
| longitude | 81,376 | 100.56 (99.89, 101.29) |
| NA | 359 | |
| road_description | 81,735 | |
| bridge (across river/canal) | 5 (<0.1%) | |
| connecting to private area | 463 (0.6%) | |
| connecting to public/commercial area | 1,001 (1.2%) | |
| connecting to school area | 121 (0.1%) | |
| four-way intersection | 348 (0.4%) | |
| grade-separated intersection/ramps | 184 (0.2%) | |
| lane-changing area | 7 (<0.1%) | |
| merge lane | 19 (<0.1%) | |
| motorcycle lane | 1 (<0.1%) | |
| other | 7,614 (9.3%) | |
| pedestrian path | 1 (<0.1%) | |
| roundabout | 11 (<0.1%) | |
| sharp curve | 616 (0.8%) | |
| straight road | 58,183 (71%) | |
| t-intersection | 414 (0.5%) | |
| u-turn area | 10 (<0.1%) | |
| wide curve | 12,552 (15%) | |
| y-intersection | 184 (0.2%) | |
| zebra crossing/pedestrian crossing | 1 (<0.1%) | |
| slope_description | 81,735 | |
| no slope | 65,797 (81%) | |
| other | 11,575 (14%) | |
| slope area | 4,363 (5.3%) | |
| 1 Median (IQR); Range; n (%) | ||
Observations
In terms of data health, we can see that there are 359 missing records in the latitude and longitude columns. Since these missing records make up <5% of the 81,735 total road accidents recorded, we can remove these from our data set.
We can use latitude and longitude columns to convert the data into a sf object and transform the coordinates into CRS. The EPSG code for latitude-longitude projection is 4326.
Thailand adopts EPSG 32647, which is in metres.
Since the Bangkok Metropolitan Region is made up of 6 provinces – Bangkok, Nonthaburi, Nakhon Pathom, Pathum Thani, Samut Prakan, and Samut Sakhon, we can use the province_en column to filter for these provinces.
The code chunk performs several functions for preparing and transforming the dataset:
- The
filter()function from the dplyr package is used to remove rows with missing or empty longitude and latitude values, ensuring that the dataset contains only valid spatial data. - The sf package’s
st_as_sf()function converts the data frame into a spatial object (simple feature) using the longitude and latitude columns, setting the coordinate reference system (CRS) to EPSG 4326. - The spatial data is reprojected to EPSG 32647 (a local UTM projection used in Thailand) using
st_transform()from the sf package. - The
filter()function is also applied to retain data exclusively from the Bangkok Metropolitan Region (BMR) by filtering for specific provinces.
accidents_bmr <- accidents %>%
# Removes rows with missing longitude and latitude values
filter(!is.na(longitude) & longitude != "",
!is.na(latitude) & latitude !="") %>%
# Converts data to an sf object using longitude and latitude
st_as_sf(coords = c("longitude", "latitude"),
crs = 4326) %>%
# Transforms to the projection used in Thailand
st_transform(crs = 32647) %>%
# Filter for BMR data
filter(province_en %in% c("Bangkok", "Nonthaburi", "Nakhon Pathom", "Pathum Thani", "Samut Prakan", "Samut Sakhon")) Let’s perform a check for duplicated records before we move on. The code chunk below identifies all rows in the accidents_bmr dataframe that have an exact duplicate (i.e., another row with the same values in all columns) using group_by_all().
duplicate <- accidents_bmr %>%
group_by_all() %>%
filter(n()>1) %>%
ungroup()
duplicateSimple feature collection with 0 features and 13 fields
Bounding box: xmin: NA ymin: NA xmax: NA ymax: NA
Projected CRS: WGS 84 / UTM zone 47N
# A tibble: 0 × 14
# ℹ 14 variables: acc_code <dbl>, incident_datetime <dttm>, province_en <chr>,
# agency <chr>, vehicle_type <chr>, presumed_cause <chr>,
# accident_type <chr>, number_of_vehicles_involved <dbl>,
# number_of_fatalities <dbl>, number_of_injuries <dbl>,
# weather_condition <chr>, road_description <chr>, slope_description <chr>,
# geometry <GEOMETRY [m]>
Results confirm that there are no duplicated records found.
write_rds(accidents_bmr, "data/rds/accidents_bmr.rds")accidents_bmr <- read_rds("data/rds/accidents_bmr.rds")Let’s take a quick glance at the points:
tmap_mode('plot')
tm_shape(accidents_bmr) +
tm_dots(col = "#800200",
alpha=0.4,
size=0.05) +
tm_layout(main.title = "Accidents",
main.title.position = "center",
main.title.size = 1,
bg.color = "#E4D5C9",
frame = F)
2.2.2 Administrative Boundary
The adminboundary dataset, which we downloaded from HDX, is in ESRI shapefile format. To use this data in an R-environment, we need to import it as an sf object. We can do this using the st_read() function of the sf package. This function reads the shapefile data and returns an sf object that can be used for further analysis.
In the code chunk below, we use %>% operator is used to pipe the output of st_read() to the st_transform() function. Since the dataset we are using is the Thailand boundary, we need to assign the standard coordinate reference system for Thailand, which is EPSG 32647. st_transform() function transforms the coordinate reference system of the sf object to 32647.
adminboundary <- st_read(dsn = "data/geospatial",
layer = "tha_admbnda_adm1_rtsd_20220121") %>%
st_transform(crs = 32647)Reading layer `tha_admbnda_adm1_rtsd_20220121' from data source
`C:\kytjy\ISSS626-GAA\Take-home_Ex\Take-home_Ex01\data\geospatial'
using driver `ESRI Shapefile'
Simple feature collection with 77 features and 16 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: 97.34336 ymin: 5.613038 xmax: 105.637 ymax: 20.46507
Geodetic CRS: WGS 84
st_crs(adminboundary)Coordinate Reference System:
User input: EPSG:32647
wkt:
PROJCRS["WGS 84 / UTM zone 47N",
BASEGEOGCRS["WGS 84",
ENSEMBLE["World Geodetic System 1984 ensemble",
MEMBER["World Geodetic System 1984 (Transit)"],
MEMBER["World Geodetic System 1984 (G730)"],
MEMBER["World Geodetic System 1984 (G873)"],
MEMBER["World Geodetic System 1984 (G1150)"],
MEMBER["World Geodetic System 1984 (G1674)"],
MEMBER["World Geodetic System 1984 (G1762)"],
MEMBER["World Geodetic System 1984 (G2139)"],
ELLIPSOID["WGS 84",6378137,298.257223563,
LENGTHUNIT["metre",1]],
ENSEMBLEACCURACY[2.0]],
PRIMEM["Greenwich",0,
ANGLEUNIT["degree",0.0174532925199433]],
ID["EPSG",4326]],
CONVERSION["UTM zone 47N",
METHOD["Transverse Mercator",
ID["EPSG",9807]],
PARAMETER["Latitude of natural origin",0,
ANGLEUNIT["degree",0.0174532925199433],
ID["EPSG",8801]],
PARAMETER["Longitude of natural origin",99,
ANGLEUNIT["degree",0.0174532925199433],
ID["EPSG",8802]],
PARAMETER["Scale factor at natural origin",0.9996,
SCALEUNIT["unity",1],
ID["EPSG",8805]],
PARAMETER["False easting",500000,
LENGTHUNIT["metre",1],
ID["EPSG",8806]],
PARAMETER["False northing",0,
LENGTHUNIT["metre",1],
ID["EPSG",8807]]],
CS[Cartesian,2],
AXIS["(E)",east,
ORDER[1],
LENGTHUNIT["metre",1]],
AXIS["(N)",north,
ORDER[2],
LENGTHUNIT["metre",1]],
USAGE[
SCOPE["Navigation and medium accuracy spatial referencing."],
AREA["Between 96°E and 102°E, northern hemisphere between equator and 84°N, onshore and offshore. China. Indonesia. Laos. Malaysia - West Malaysia. Mongolia. Myanmar (Burma). Russian Federation. Thailand."],
BBOX[0,96,84,102]],
ID["EPSG",32647]]
Preliminary look into the adminboundary data shows that ADM1_EN can be used to filter for the 6 regions in BMR.
glimpse(adminboundary)Rows: 77
Columns: 17
$ Shape_Leng <dbl> 2.417227, 1.695100, 1.251111, 1.884945, 3.041716, 1.739908,…
$ Shape_Area <dbl> 0.13133873, 0.07926199, 0.05323766, 0.12698345, 0.21393797,…
$ ADM1_EN <chr> "Bangkok", "Samut Prakan", "Nonthaburi", "Pathum Thani", "P…
$ ADM1_TH <chr> "กรุงเทพมหานคร", "สมุทรปราการ", "นนทบุรี", "ปทุมธานี", "พระนครศรีอ…
$ ADM1_PCODE <chr> "TH10", "TH11", "TH12", "TH13", "TH14", "TH15", "TH16", "TH…
$ ADM1_REF <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ ADM1ALT1EN <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ ADM1ALT2EN <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ ADM1ALT1TH <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ ADM1ALT2TH <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ ADM0_EN <chr> "Thailand", "Thailand", "Thailand", "Thailand", "Thailand",…
$ ADM0_TH <chr> "ประเทศไทย", "ประเทศไทย", "ประเทศไทย", "ประเทศไทย", "ประเทศ…
$ ADM0_PCODE <chr> "TH", "TH", "TH", "TH", "TH", "TH", "TH", "TH", "TH", "TH",…
$ date <date> 2019-02-18, 2019-02-18, 2019-02-18, 2019-02-18, 2019-02-18…
$ validOn <date> 2022-01-22, 2022-01-22, 2022-01-22, 2022-01-22, 2022-01-22…
$ validTo <date> -001-11-30, -001-11-30, -001-11-30, -001-11-30, -001-11-30…
$ geometry <MULTIPOLYGON [m]> MULTIPOLYGON (((674339.8 15..., MULTIPOLYGON (…
adminboundary_bmr <- adminboundary %>%
select("ADM1_EN") %>%
filter(ADM1_EN %in% c("Bangkok", "Nonthaburi", "Nakhon Pathom", "Pathum Thani", "Samut Prakan", "Samut Sakhon"))write_rds(adminboundary_bmr, "data/rds/adminboundary_bmr.rds")adminboundary_bmr <- read_rds("data/rds/adminboundary_bmr.rds")After importing the dataset, we will plot it to see how it looks using tmap.
tmap_mode('plot')
tm_shape(adminboundary_bmr)+
tm_fill(col ="#f4e9e8",
alpha = 0.6) +
tm_borders(col = "#ddafa1",
lwd = 0.1,
alpha = 1) +
tm_layout(main.title = "BMR Administrative Boundary",
main.title.position = "center",
main.title.size = 1,
bg.color = "#E4D5C9",
frame = F)
Let’s plot out the administrative boundaries together with the points of accidents:
tmap_mode('plot')
tm_shape(adminboundary_bmr) +
tm_fill(col ="#f4e9e8",
alpha = 0.6) +
tm_borders(col = "#ddafa1",
lwd = 0.1,
alpha = 1) +
tm_shape(accidents_bmr) +
tm_dots(col = "#800200",
alpha=0.4,
size=0.05) +
tm_layout(main.title = "BMR Administrative Boundary and Accidents",
main.title.position = "center",
main.title.size = 1,
bg.color = "#E4D5C9",
frame = F)
2.2.3 Road Lines
The code chunk below imports MP_SUBZONE_WEB_PL shapefile by using st_read() of sf packages.
roads <- st_read(dsn = "data/geospatial",
layer = "hotosm_tha_roads_lines_shp")Reading layer `hotosm_tha_roads_lines_shp' from data source
`C:\kytjy\ISSS626-GAA\Take-home_Ex\Take-home_Ex01\data\geospatial'
using driver `ESRI Shapefile'
Simple feature collection with 2792590 features and 14 fields
Geometry type: MULTILINESTRING
Dimension: XY
Bounding box: xmin: 97.34457 ymin: 5.643645 xmax: 105.6528 ymax: 20.47168
CRS: NA
Observations
The geometry of our data is in multi-linestrings. To convert to individual linestrings, we use
st_cast().Note that roads does not have crs information. The
st_set_crs()function allows us to assign coordinate reference system for the data based on latitude and longitude seen under the Bounding Box variable which is decimal degrees. After assigning the CRS,st_transform()is used to reproject the data to the local CRS. Finally, we can usest_crs()to verify that the correct CRS has been applied.
roads_sf <- st_set_crs(roads, 4326) %>%
st_transform(crs = 32647) %>%
st_cast("LINESTRING")st_crs(roads_sf)Coordinate Reference System:
User input: EPSG:32647
wkt:
PROJCRS["WGS 84 / UTM zone 47N",
BASEGEOGCRS["WGS 84",
ENSEMBLE["World Geodetic System 1984 ensemble",
MEMBER["World Geodetic System 1984 (Transit)"],
MEMBER["World Geodetic System 1984 (G730)"],
MEMBER["World Geodetic System 1984 (G873)"],
MEMBER["World Geodetic System 1984 (G1150)"],
MEMBER["World Geodetic System 1984 (G1674)"],
MEMBER["World Geodetic System 1984 (G1762)"],
MEMBER["World Geodetic System 1984 (G2139)"],
ELLIPSOID["WGS 84",6378137,298.257223563,
LENGTHUNIT["metre",1]],
ENSEMBLEACCURACY[2.0]],
PRIMEM["Greenwich",0,
ANGLEUNIT["degree",0.0174532925199433]],
ID["EPSG",4326]],
CONVERSION["UTM zone 47N",
METHOD["Transverse Mercator",
ID["EPSG",9807]],
PARAMETER["Latitude of natural origin",0,
ANGLEUNIT["degree",0.0174532925199433],
ID["EPSG",8801]],
PARAMETER["Longitude of natural origin",99,
ANGLEUNIT["degree",0.0174532925199433],
ID["EPSG",8802]],
PARAMETER["Scale factor at natural origin",0.9996,
SCALEUNIT["unity",1],
ID["EPSG",8805]],
PARAMETER["False easting",500000,
LENGTHUNIT["metre",1],
ID["EPSG",8806]],
PARAMETER["False northing",0,
LENGTHUNIT["metre",1],
ID["EPSG",8807]]],
CS[Cartesian,2],
AXIS["(E)",east,
ORDER[1],
LENGTHUNIT["metre",1]],
AXIS["(N)",north,
ORDER[2],
LENGTHUNIT["metre",1]],
USAGE[
SCOPE["Navigation and medium accuracy spatial referencing."],
AREA["Between 96°E and 102°E, northern hemisphere between equator and 84°N, onshore and offshore. China. Indonesia. Laos. Malaysia - West Malaysia. Mongolia. Myanmar (Burma). Russian Federation. Thailand."],
BBOX[0,96,84,102]],
ID["EPSG",32647]]
Results above show that the EPSG is indicated as 32647 now.
glimpse(roads_sf)Rows: 2,792,591
Columns: 15
$ name <chr> "ถนนฉลองกรุง", "ซอยฉลองกรุง 1/1", NA, NA, "ถนนฉลองกรุง", NA, "…
$ name_en <chr> "Chalong Krung Road", "Soi Chalong Krung 1/1", NA, NA, "Cha…
$ highway <chr> "secondary", "residential", "secondary_link", "service", "s…
$ surface <chr> "paved", NA, NA, NA, "concrete", NA, NA, "unpaved", NA, NA,…
$ smoothness <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ width <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ lanes <chr> NA, NA, NA, NA, "2", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ oneway <chr> "yes", NA, "yes", NA, "yes", NA, NA, NA, NA, NA, NA, NA, NA…
$ bridge <chr> NA, NA, NA, NA, "yes", NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ layer <chr> NA, NA, NA, NA, "1", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ source <chr> NA, NA, NA, NA, "Bing", NA, NA, "GPS", NA, NA, NA, NA, NA, …
$ name_th <chr> "ถนนฉลองกรุง", "ซอยฉลองกรุง 1/1", NA, NA, "ถนนฉลองกรุง", NA, "…
$ osm_id <dbl> 1125681229, 594401607, 472283206, 594401608, 116847248, 317…
$ osm_type <chr> "ways_line", "ways_line", "ways_line", "ways_line", "ways_l…
$ geometry <LINESTRING [m]> LINESTRING (693686.1 151979..., LINESTRING (6933…
Next, we will look at the classification of road networks in the data.
unique(roads_sf$highway) [1] "secondary" "residential" "secondary_link" "service"
[5] "tertiary" "path" "footway" "track"
[9] "unclassified" "trunk" "trunk_link" "primary"
[13] "primary_link" "steps" "motorway_link" "cycleway"
[17] "pedestrian" "tertiary_link" "motorway" "construction"
[21] "road" "raceway" "corridor" "living_street"
[25] "escape" "proposed" "busway" "bridleway"
[29] "abandoned" "parth" "barrier" "paved"
Upon reviewing the road classifications against the highway descriptions by OpenStreetMap, we observe that not all categories are relevant to our analysis, which focuses primarily on driving networks. We will keep the 13 types of road segments below for the scope of our study.
| Name | Description |
|---|---|
| Primary, Primary_Link | Major highway linking large towns |
| Secondary, Secondary_Link | Highways which are not part of major routes, but nevertheless form a link in the national route network |
| Tertiary, Tertiary_Link | Roads connecting smaller settlements, and within large settlements for roads connecting local centres |
| Trunk, Trunk_Link | Major highway that don’t meet the requirements for motorways, and their link roads (sliproads and ramps). |
| Motorway, Motorway_Link | Highest-performance roads within a territory, generally referred to as motorways, freeways, or expressways. |
| Residential | Road generally used for local traffic within the settlement. Primarily used for access to residential properties but may include access to some non-residential properties (e.g. a corner shop or convenience store). |
| Unclassified | Roads with the lowest priority in the interconnecting road network |
| Service | Provides access to a building service station, beach, campsite, industrial estate, business park, etc. |
In the code chunk below, we will first specify the road classes that we want to retain. Next, we will filter roads_sf object to remove all the rows that does not have our desired highway attribute value.
highwaytypes <- c("primary",
"primary_link",
"secondary",
"secondary_link",
"tertiary",
"tertiary_link",
"trunk",
"trunk_link",
"motorway",
"motorway_link",
"residential",
"unclassified",
"service")
roads_sf_filtered <- roads_sf %>%
select("highway") %>%
filter(highway %in% highwaytypes)
unique(roads_sf_filtered$highway) [1] "secondary" "residential" "secondary_link" "service"
[5] "tertiary" "unclassified" "trunk" "trunk_link"
[9] "primary" "primary_link" "motorway_link" "tertiary_link"
[13] "motorway"
Since the roads dataset includes areas outside BMR and our analysis is focused on the region within the BMR, we will need to remove unnecessary rows. To do so, we will use st_intersection().
roads_bmr <- st_intersection(roads_sf_filtered,
adminboundary_bmr)write_rds(roads_bmr, "data/rds/roads_bmr.rds")roads_bmr <- read_rds("data/rds/roads_bmr.rds")Now that we have scoped out the dataset, we will now plot to see the BMR road network.
tmap_mode("plot")
tm_shape(adminboundary_bmr) +
tm_polygons() 
#tm_shape(roads_bmr) +
# tm_lines(col="highway", palette ="viridis") +
# tm_layout(main.title = "Road Network in Singapore",
# main.title.position = "center",
# main.title.size = 1.2,
# legend.outside = TRUE,
# frame = TRUE) +
# tm_borders(alpha = 0.5)par(bg = '#E4D5C9', mar = c(0,0,1,0))
plot(st_geometry(adminboundary_bmr),
col = "#eeeae2",
main = "Road Network in BMR")
plot(st_geometry(roads_bmr),
add=T,
col='#800200')
3 Data Wrangling
3.1 Deriving new variables from Accidents data
Using the incident_datetime column, we can also derive additional columns such as seasons, month, day of the week, time of the day (e.g. morning or evening peak periods). The morning rush hour is said to last from 6am to 9am and evening rush hour is reported to be from 4pm to 7pm (The Nation).
I also think it would be interesting to derive a column that indicates accidents that happened during the Songkran festival holiday. Notoriously dubbed as the Seven Deadly Days of Songkran, road accidents in Bangkok is said to surge due increased traffic from people traveling to celebrate with family.
accidents_bmr_extra <- accidents_bmr %>%
# Derive month, days, hour columns as well as a Songkran indicator
mutate(
month = month(incident_datetime,
label = TRUE),
day = wday(incident_datetime,
label = TRUE),
hour = hour(incident_datetime),
songkran = ifelse(
as_date(incident_datetime) >= as_date(paste0(year(incident_datetime), "-04-09")) &
as_date(incident_datetime) <= as_date(paste0(year(incident_datetime), "-04-16")) &
year(incident_datetime) %in% c(2019, 2020, 2021, 2022),
1,
0)) %>%
# Derive season column and peak period indicator
mutate(
weektype = ifelse(
day %in% c("Sat", "Sun"),
"weekend",
"weekday"
),
season = ifelse(
month %in% c("Feb", "Mar", "Apr", "May"),
"Summer",
ifelse(
month %in% c("Jun", "Jul", "Aug", "Sep", "Oct"),
"Rainy",
"Winter"
)
),
peakperiod = ifelse(
hour > 6 & hour < 9,
"morningpeak",
ifelse(
hour > 16 & hour < 19,
"eveningpeak",
"non-peak"
)
)
) %>%
# Drop columns not required anymore
select(-c("acc_code",
"incident_datetime",
"agency",
"month", "day", "hour"))The code chunk above performs the following function:
- The lubridate package within tidyverse is utilised to extract temporal components (month, day of the week, and hour) from the incident_datetime column using the
month(),wday(), andhour()functions, respectively. - The songkran indicator is created using
ifelse(), along withas_date()andyear(), to generate a binary variable (1 for “Yes”, 0 for “No”) indicating whether the accident occurred during the Songkran festival (April 9 to April 16) in the years 2019, 2020, 2021, or 2022. - Two additional indicators are derived using
ifelse():- season variable: classifies accidents into seasons (“Summer”, “Rainy”, or “Winter”) based on the month.
- peakperiod variable: identifies whether the accident occurred during morning peak hours (7-9 AM), evening peak hours (5-7 PM), or non-peak times.
- Finally,
select()from dplyr is used to keep only the necessary variables for the study.
3.2 Converting sf format into spatstat’s ppp format
In order to use the capabilities of spatstat package, a spatial dataset should be converted into an object of class planar point pattern (ppp). A ppp object contains the spatial coordinates of the points, the marks attached to the points (if any), the window in which the points were observed, and the name of the unit of length for the spatial coordinates. Thus, a single object of class ppp contains all the information required to perform spatial point pattern analysis.
In previous section, we have created sf objects of accident points. Now, we will convert them into ppp objects using as.ppp() function from spatstat package.
The code chunk below converts the accidents_bmr_lean object to a point pattern object of class ppp. st_coordinates() function is used to extract the coordinates of the accidents_bmr_lean object and st_bbox() function is used to extract the bounding box of the accidents_bmr_lean object. The resulting object accidents_ppp is a point pattern object of class ppp.
accidents_ppp <- as.ppp(st_coordinates(accidents_bmr_extra),
st_bbox(accidents_bmr_extra))
par(bg = '#E4D5C9',
mar = c(0,0,1,0))
plot(accidents_ppp)
3.3 Duplicates check
We will use summary() function to get summary information of accidents_ppp object.
summary(accidents_ppp)Planar point pattern: 12986 points
Average intensity 1.218049e-06 points per square unit
*Pattern contains duplicated points*
Coordinates are given to 10 decimal places
Window: rectangle = [591277.5, 710166.1] x [1486845.7, 1576520.5] units
(118900 x 89670 units)
Window area = 10661300000 square units
Observations
Note that the message above suggests that the pattern contains duplicated points.
When analysing spatial point processes, it is important to avoid duplication of points. This is because statistical methodology for spatial point processes is based largely on the assumption that processes are simple, i.e., that points of the process can never be coincident. When the data have coincident points, some statistical procedures designed for simple point processes will be severely affected.
3.4 Jittering
To resolve the issue of duplicated points, we apply jittering with the rjitter() function. This adds a small variation to the points, preventing them from occupying the exact same location.
set.seed(1234)
accidents_ppp_jit <- rjitter(accidents_ppp,
retry=TRUE,
nsim=99,
drop=TRUE)The code below checks the jittered points in the chosen simulation (i.e. Simulation 99) to ensure that no duplicates remain after applying jittering.
any(duplicated(accidents_ppp_jit[["Simulation 99"]]))[1] FALSE
par(bg = '#E4D5C9',
mar = c(0,0,1,0))
accidents_ppp_jit <- accidents_ppp_jit[["Simulation 99"]]
plot(accidents_ppp_jit)
3.5 Creating Observation Windows
Many data types in spatstat require us to specify the region of space inside which the data were observed. This is the observation window and it is represented by an object of class owin. In this analysis, our study area is BMR, hence we will use BMR boundary as the observation window for spatial point pattern analysis.
To convert our adminboundary_bmr sf object to owin object, we will use as.owin() function from spatstat package.
adminboundary_bmr_owin <- as.owin(adminboundary_bmr)
par(bg = '#E4D5C9',
mar = c(0,0,1,0))
plot.owin(adminboundary_bmr_owin)
summary(adminboundary_bmr_owin)Window: polygonal boundary
single connected closed polygon with 13779 vertices
enclosing rectangle: [587893.5, 712440.5] x [1484413.7, 1579076.3] units
(124500 x 94660 units)
Window area = 7668990000 square units
Fraction of frame area: 0.65
3.6 Combing ppp and owin objects
In section 3.4, we have created our ppp object (accidents_ppp_jit) which represents the spatial points of accident locations. In section 3.5, we have created a owin object called (adminboundary_bmr_owin), which represent the observation window of our analysis.
The observation window adminboundary_bmr_owin and the point pattern accidents_ppp_jit can be combined, so that the custom window replaces the default rectangular extent (as seen in section 3.2).
acc_bmr_ppp = accidents_ppp_jit[adminboundary_bmr_owin]
par(bg = '#E4D5C9',
mar = c(0,0,1,0))
plot(acc_bmr_ppp)
4 Exploratory Spatial Data Analysis
Descriptive statistics are used in point pattern analysis to summarise a point pattern’s basic properties, such as its central tendency and dispersion. The mean centre and the median centre are two often employed metrics for central tendency.
4.1 Measuring Central Tendency
4.1.1 Mean Center
Mean center is the arithmetic average of the (x, y) coordinates of all point in the study area. Similar to mean in statistical analysis, mean center is influenced to a greater degree by the outliers.
accidents_xy <- st_coordinates(accidents_bmr_extra)
accidents_mc <- apply(accidents_xy, 2, mean)
accidents_mc X Y
668399.5 1523495.8
The results show that the mean centre is at (668399.5, 1523495.9).
4.1.2 Median Center
Median center is the location that minimises the sum of distances required to travel to all points within an observation window. The procedure begins at a predetermined point, such as the median center, as the initial point. Then, the algorithm updates the median center’s new coordinates (x’, y’) continually until the optimal value is reached. The median center, as opposed to the mean center, offers a more reliable indicator of central tendency as it is unaffected by outliers.
accidents_medc <- apply(accidents_xy, 2, median)
accidents_medc X Y
673446.1 1520755.0
Based on the results, the median centre of accidents is (673446.1, 1520755.0).
The mean centers and median centers are similar. This may imply that the distribution of the data is relatively balanced and there is not a significant difference in the spatial patterns between the accident points. Additionally, this indicates that both the mean center and median center are effective measures for analyzing the central tendency of the data in this context.
par(bg = '#E4D5C9', mar = c(0,0,1,0))
plot(st_geometry(adminboundary_bmr),
col='#eeeae2')
plot(accidents_xy,
add = T, cex=0.7, pch = 21,
main="Mean and Median Centers of Accidents in BMR")
points(cbind(accidents_mc[1], accidents_mc[2]), pch='*', col='#f5347f', cex=3)
points(cbind(accidents_medc[1], accidents_medc[2]), pch='*', col='#bb8bdc', cex=3)
4.2 Measuring Dispersion
4.2.1 Standard Distance
Standard distances are defined similarly to standard deviations. This indicator measures how dispersed a group of points is around its mean center.
accidents_sd <- sqrt(sum((accidents_xy[,1] - accidents_mc[1])^2 +
(accidents_xy[,2] - accidents_mc[2])^2)
/ nrow(accidents_xy))
accidents_sd[1] 27235.14
4.2.2 Plotting Standard Distance
In this section, we will create bearing circle of accident points using the standard distance value we have calculated earlier. This can provide visual representation of the dispersion.
par(bg = '#E4D5C9',
mar = c(0,0,1,0))
plot(st_geometry(adminboundary_bmr), col='#eeeae2', main="Standard Distance of Accidents in BMR")
plot(accidents_xy, cex=.5)
points(cbind(accidents_mc[1], accidents_mc[2]), pch='*', col='#f5347f', cex=3)
bearing <- 1:360 * pi/180
cx <- accidents_mc[1] + accidents_sd * cos(bearing)
cy <- accidents_mc[2] + accidents_sd * sin(bearing)
circle <- cbind(cx, cy)
lines(circle, col='#f5347f', lwd=2)
4.3 Spatial Randomness Test
Clark and Evans (1954) give a very simple test of spatial randomness called Clark and Evans aggregation index (R). It is the ratio of the observed mean nearest neighbour distance in the pattern to that expected for a Poisson point process of the same intensity. R-value >1 suggests ordering, while R-value <1 suggests clustering.
We will perform the Clark-Evans test of aggregation for a spatial point pattern by using clarkevans.test() of statspat.
The test hypotheses are:
\(H_0\) = The distribution of accident points are randomly distributed.
\(H_1\) = The distribution of accidents points are not randomly distributed.
The 95% confidence interval will be used.
set.seed(1234)
clarkevans.test(acc_bmr_ppp,
correction="none",
clipregion="adminboundary_bmr_owin",
alternative=c("clustered"),
nsim=99)
Clark-Evans test
No edge correction
Z-test
data: acc_bmr_ppp
R = 0.22579, p-value < 2.2e-16
alternative hypothesis: clustered (R < 1)
The Clark-Evans test for the accident points shows an R-value of 0.22579, which is less than 1. This indicates a clustered distribution. The p-value is less than 2.2e-16, which is extremely small and less than the significance level of 0.05. This means that we will reject the null hypothesis (\(H_0\)) and accept the alternative hypothesis (\(H_1\)).
Therefore, the statistical inference from this test is that the accident points are not randomly distributed but are clustered. This suggests that there may be underlying factors influencing the spatial distribution of these points.
5 First-Order Spatial Point Pattern Analysis
After data wrangling is complete, we will start to perform first-order spatial point pattern analysis using functions from spatstat package. First-order properties concern the characteristics of individual point locations and their variations of their density across space and are mostly addressed by density-based techniques, such as quadrant analysis and kernel density estimation.
Investigation of the intensity of a point pattern is one of the first and most important steps in point pattern analysis. If the point process has an intensity function λ(u), this function can be estimated non-parametrically by kernel estimation. Kernel estimation allows for smoothing of the probability density estimation of a random variable (in this analysis a point event) based on kernels as weights.
5.1 Rescaling acc_bmr_ppp
The EPSG: 32647 Coordinate References System uses meters as the standard unit. Thus, acc_bmr_ppp prepared earlier is also in metres. However, we will need to convert the measuring unit from metre to kilometeres when calculating the kernel density estimators for entirety of BMR because kilometers provide a more appropriate scale for analysing large areas.
acc_bmr_ppp.km <- rescale(acc_bmr_ppp, 1000, "km")5.2 Computing Default Kernel Density Estimation
Kernel Destiny Estimation (KDE) generates a surface (raster) representing the estimated distribution of point events over the observation window. Each cell in the KDE layer carries a value representing the estimated density of that location.
KDE allows us to identify traffic accident hot spots, which is an essential step for the appropriate allocation of resources for safety improvements. To do that, we use density.ppp() from the spatstat package.
Show the code
par(bg = '#E4D5C9',
mar = c(0,0,1,0))
kde_default <- density(acc_bmr_ppp.km)
plot(kde_default, main = "Default Density KDE")
contour(kde_default, add=TRUE)
The key argument to pass to the density() function for point pattern objects is sigma, which determines the smoothing bandwidth of the kernel. A smaller bandwidth reveals more details, creating peaks and valleys, while a larger bandwidth smooths the distribution but with less precision. If the bandwidth is too small, the result can look overly noisy, and if it’s too large, important details may be lost due to over-smoothing.
By default, when the sigma value isn’t provided, a bandwidth is determined by a simple rule of thumb that depends only on the size of the window. This default setting might not always give the desired result. In the KDE plot we generated, there’s evidence of over-smoothing, where only one large spatial cluster is visible, potentially hiding smaller clusters or important details.
To address this, we can manually set the bandwidth using the sigma argument or choose a different kernel function through the kernel argument. This will help create more intuitive and detailed KDE maps that better capture the structure of the data.
5.3 KDE Layers with Fixed Bandwidth
5.3.1 Computing Fixed Bandwidths Using Different Bandwidth Selection Methods
4 automatic bandwidth calculation methods are available:
bw.diggle(): In the Cross Validated Bandwidth Selection, the bandwidth is chosen to minimise the mean-square error criterion. The mean-square error is a measure of the average of the squares of the errors - that is, the average squared difference between the estimated values and the actual value.bw.CvL(): In the Cronie and van Lieshout’s Criterion for Bandwidth Selection, the bandwidth is chosen to minimise the discrepancy between the area of the observation window and the sum of reciprocal estimated intensity values at the points of the point process. This method aims to choose a bandwidth that best represents the underlying point process, taking into account both the observed points and the area they occupy.bw.scott(): In the Scott’s Rule for Bandwidth Selection, the bandwidth is computed by the rule of thumb where the bandwidth is proportional to \(n^{-1/(d+4)}\), where n is the number of points and d is the number of spatial dimensions. This method is useful for estimating gradual trend.bw.ppl(): In the Likelihood Cross Validation Bandwidth Selection, the bandwidth is chosen to maximise the point process likelihood.
bw_diggle <- bw.diggle(acc_bmr_ppp.km)
bw_diggle sigma
0.04025879
bw_CvL <- bw.CvL(acc_bmr_ppp.km)
bw_CvL sigma
11.55349
bw_scott <- bw.scott(acc_bmr_ppp.km)
bw_scott sigma.x sigma.y
4.527996 3.318169
bw_ppl <- bw.ppl(acc_bmr_ppp.km)
bw_ppl sigma
0.3493275
Observations
Note that bw_diggle, bw_CvL and bw_ppl functions produce a numeric sigma value, while bw_scott provides a separate bandwidth for the x and y coordinates respectively. The sigma.x and sigma.y values represent the amount of smoothing applied in each direction when estimating the kernel density.
We can specify isotropic=TRUE argument when calculating bw_scott() method to produce a single value bandwidth, or use the bw.scott.iso() function instead.
bw_scott_iso <- bw.scott.iso(acc_bmr_ppp.km)
bw_scott_iso sigma
3.876165
par(bg = '#E4D5C9',
#mar = c(0,0,1,0),
mfrow = c(1,2))
plot(bw_diggle, xlim=c(0.0,0.06), ylim=c(-160,100))
plot(bw_CvL)
par(bg = '#E4D5C9',
#mar = c(0,0,1,0),
mfrow = c(1,2))
plot(bw_scott, main="bw_scott")
plot(bw_ppl,
xlim=c(-1,5),
ylim=c(00,30000))
5.3.2 Choosing Fixed-Bandwidth KDE
kde_diggle <- density(acc_bmr_ppp.km, bw_diggle)
kde_CvL <- density(acc_bmr_ppp.km, bw_CvL)
kde_scott <- density(acc_bmr_ppp.km, bw_scott)
kde_ppl <- density(acc_bmr_ppp.km, bw_ppl)
par(bg = '#E4D5C9',
mar = c(1,1,1,1.5),
mfrow = c(2,2))
plot(kde_diggle,main = "kde_diggle")
plot(kde_CvL,main = "kde_CvL")
plot(kde_scott,main = "kde_scott")
plot(kde_ppl,main = "kde_ppl")
Next, we will try to plot histograms to compare the distribution of KDE values obtained from density() function using different bandwidth selection methods.
par(bg = '#E4D5C9',
mar = c(2,2,2,2),
mfrow = c(2,2))
hist(kde_diggle,main = "kde_diggle")
hist(kde_CvL,main = "kde_CvL")
hist(kde_scott,main = "kde_scott")
hist(kde_ppl,main = "kde_ppl")
Observations
kde_diggle and **kde_ppl*: The sharp spike at the beginning indicates a high concentration of points within the first bin, while the remaining bins show little to no density. This pattern suggests that a specific area within our observation window experiences significant spatial clustering, with other areas showing much less activity.
kde_CvL: This method results in a more evenly spread distribution, indicating that spatial point concentrations are more dispersed across the area. However, the smaller bin size in this method may smooth out the data too much, potentially masking smaller, more localised clusters or important spatial patterns.
kde_scott: Compared to the other methods, kde_scott shows a wider range of values and a less pronounced initial spike, suggesting it captures both highly concentrated and moderately concentrated areas more effectively. This method balances the spatial distribution better by not overly smoothing or skewing the data.
Based on these observations, we will proceed with the scott method for further analysis. The scott method strikes a good balance between bias and variance. If the bandwidth is too small, the estimate can become overly variable and noisy (high variance), as seen in the histograms for bw_diggle and bw_ppl. Conversely, if the bandwidth is too large, the data becomes oversmoothed, losing important spatial details (high bias), which is evident in the kde_CvL histogram. By balancing these two extremes, bw_scott provides a more comprehensive and detailed representation of spatial clustering without losing important nuances.
par(bg = '#E4D5C9',
mar = c(0,0,1,0))
kde_fixed_scott <- density(acc_bmr_ppp.km, bw_scott)
plot(kde_fixed_scott,
main = "Fixed-Bandwidth KDE for Accident Points (Using bw_scott)")
contour(kde_fixed_scott, add=TRUE)
Visual inspection reveals that the bandwidth suggested by the bw_scott method causes noticeable over-smoothing. Although automatic bandwidth selection offers a solid initial estimate, fine-tuning is often required to achieve more accurate results.
To counteract the over-smoothing, we will apply a simple adjustment by reducing the bandwidth by half. This reduction should help capture more detailed patterns in the data and avoid the loss of important spatial information.
par(bg = '#E4D5C9',
mar = c(0,0,1,0))
kde_fixed_scott <- density(acc_bmr_ppp.km, bw_scott/2)
plot(kde_fixed_scott,
main = "Fixed-Bandwidth KDE for Accident Points (Using bw_scott)")
contour(kde_fixed_scott, add=TRUE)
From the plot above, it seems that reducing the bandwidth (which shrinks the point cluster buffers) has lessened the over-smoothing effect while still clearly illuminating the accident hotspot areas.
5.3.3 Kernel Methods
There are 4 types of kernels in density.ppp(), namely: Gaussian, Epanechnikov, Quartic and Dics. A Gaussian kernel is set automatically.
The code chunk below will be used to compute three more kernel density estimations by using these three kernel function.
kde_fixed_scott.gaussian <- density(acc_bmr_ppp.km,
sigma=bw_scott,
edge=TRUE,
kernel="gaussian")
kde_fixed_scott.epanechnikov <- density(acc_bmr_ppp.km,
sigma=bw_scott,
edge=TRUE,
kernel="epanechnikov")
kde_fixed_scott.quartic <- density(acc_bmr_ppp.km,
sigma=bw_scott,
edge=TRUE,
kernel="quartic")
kde_fixed_scott.disc <- density(acc_bmr_ppp.km,
sigma=bw_scott,
edge=TRUE,
kernel="disc")
par(bg = '#E4D5C9',
mar = c(0,0,1,0),
mfrow = c(2,2))
plot(kde_fixed_scott.gaussian, main="Gaussian")
plot(kde_fixed_scott.epanechnikov, main="Epanechnikov")
plot(kde_fixed_scott.quartic, main="Quartic")
plot(kde_fixed_scott.disc, main="Disc")
Observations
Although there are minor differences in smoothness and spread, all four plots display similar density estimation patterns. This indicates that the choice of kernel function does not significantly affect the KDE results. As a result, we will not emphasise this factor in our analysis.
5.4 KDE Layers with Spatially Adaptive Bandwidth
The bandwidth of a kernel estimator can be either fixed across the entire mapping area or adaptive to suit in local situations. The fixed bandwidth method explored in our earlier analysis is said to be sensitive to highly skewed distribution of spatial point patterns over geographical units for example urban versus rural. One way to overcome this problem is by using adaptive bandwidth instead.
In the section below, we compare the fixed with adaptive bandwidth-based KDE, and how they were able to detect accident hot spots.
adaptive.density() of Spatstat offers 3 estimation methods:
- method = “voronoi”: which estimates the intensity using the Voronoi-Dirichlet tessellation
- method = “kernel”: which estimates the intensity using a variable-bandwidth kernel estimator
- `method = “nearest”: which computes an estimate of the intensity function of a point pattern dataset using the distance from each spatial location to the kth nearest points
kde_adaptive_vd <- adaptive.density(acc_bmr_ppp.km,
method = "voronoi")Show the code
par(bg = '#E4D5C9',
mar = c(0,0,1,0))
plot(kde_adaptive_vd,
main = "Voronoi-Dirichlet Adaptive Density Estimate")
kde_adaptive_kernel <- adaptive.density(acc_bmr_ppp.km,
method = "kernel")Show the code
par(bg = '#E4D5C9',
mar = c(0,0,1,0))
plot(kde_adaptive_kernel,
main = "Adaptive Kernel Density Estimate")
kde_adaptive_knn <- adaptive.density(acc_bmr_ppp.km,
method = "nearest",
k = 100)Show the code
par(bg = '#E4D5C9',
mar = c(0,0,1,0))
plot(kde_adaptive_knn,
main = "Nearest-Neighbour Adaptive Density Estimate")
5.4.1 Choosing Adaptive KDE Method
Just as we did with the fixed bandwidth, we can create histograms to compare the distribution of KDE values obtained from the density() function using different adaptive bandwidth selection methods.
par(bg = '#E4D5C9',
mar = c(2,2,2,2),
mfrow = c(2,2))
hist(kde_adaptive_vd, main = "Voronoi-Dirichlet Adaptive")
hist(kde_adaptive_kernel, main = "Adaptive Kernel")
hist(kde_adaptive_knn, main = "Nearest-Neighbour Adaptive")
Observations
Analysis of the outputs shows no significant differences in the KDE value distributions across the various methods. Each method reveals a high concentration of points in the same area. Given this consistency, we will opt for the Adaptive Kernel method because it provides a greater number of bins. This increased number of bins allows for a more granular and detailed view of the density distribution, which can enhance our analysis by offering finer insights into spatial patterns.
5.5 Plotting Interactive KDE Maps
raster_kde_fixed_scott <- raster(kde_fixed_scott)
raster_kde_adaptive_nn <- raster(kde_adaptive_knn)
raster_kde_adaptive_kernel <- raster(kde_adaptive_kernel)
projection(raster_kde_fixed_scott) <- CRS("+init=EPSG:32647 +units=km")
projection(raster_kde_adaptive_nn) <- CRS("+init=EPSG:32647 +units=km")
projection(raster_kde_adaptive_kernel) <- CRS("+init=EPSG:32647 +units=km")Show the code
tmap_mode('view')
kde_fixed_scott <- tm_basemap(server = "OpenStreetMap") +
tm_shape(raster_kde_fixed_scott) +
tm_raster("layer",
n = 10,
title = "KDE_Fixed_scott",
alpha = 0.6,
palette = c("#f9edd1","#feb7c9","#f775a9", "#bb8bdc","#021c9e")) +
tm_shape(adminboundary_bmr)+
tm_polygons(alpha=0.1,id="ADM1_EN")+
tmap_options(check.and.fix = TRUE)
kde_adaptive_nn <- tm_basemap(server = "OpenStreetMap") +
tm_shape(raster_kde_adaptive_nn) +
tm_raster("layer",
n = 7,
title = "KDE_Adaptive_nn",
style = "pretty",
alpha = 0.6,
palette = c("#f9edd1","#feb7c9","#f775a9", "#bb8bdc","#021c9e")) +
tm_shape(adminboundary_bmr)+
tm_polygons(alpha=0.1,id="ADM1_EN")+
tmap_options(check.and.fix = TRUE)
kde_adaptive_kernel <- tm_basemap(server = "OpenStreetMap") +
tm_shape(raster_kde_adaptive_kernel) +
tm_raster("layer",
n = 7,
title = "KDE_Adaptive_Kernel",
style = "pretty",
alpha = 0.6,
palette = c("#f9edd1","#feb7c9","#f775a9", "#bb8bdc","#021c9e")) +
tm_shape(adminboundary_bmr)+
tm_polygons(alpha=0.1,id="ADM1_EN")+
tmap_options(check.and.fix = TRUE)
tmap_arrange(kde_fixed_scott,
kde_adaptive_nn,
kde_adaptive_kernel,
ncol=1,
nrow=3,
sync = TRUE)Show the code
tmap_mode('plot')Observations
The analysis reveals that the highest concentration of accidents is found in Bangkok, particularly in the vicinity of the Bangkok−Ban Chang Motorway and the Bangkok Outer Ring Road. These roads are part of Motorway Route 7 and Route 9, respectively, where accident concentrations can reach up to 30 incidents. Additionally, Borommaratchachonnani Road, which is part of Highway 338 and runs along the western edge of Bangkok near the Chao Phraya River, also exhibits high accident rates. Other areas with notable accident density include major highways within the BMR.
Different maps suggests different stories. The Adaptive Nearest Neighbour and Adaptive Kernel KDE methods both identify a significant concentration of accidents in the Khlong Chang Tai area along Highway 3701 in Samut Prakan, with a reported high concentration of 600 incidents. This area stands out due to its substantial accident rates, suggesting that it might be a focal point for further investigation and intervention. This was not identified in the Fixed Bandwidth KDE.
Overall, the findings indicate that accident hotspots are predominantly located around major roadways and highways, with specific regions exhibiting particularly high accident densities. This pattern highlights the need for targeted road safety measures and further analysis in these critical areas.
5.6 Province-Level KDE
For a more detailed look, we will create province-area level KDE maps for 2 provinces identified. In order to create such maps, we will carry out additional data wrangling as required.
Firstly, we will filter out different planning areas as separate sf objects from adminboundary_bmr.
bkk = adminboundary_bmr %>% filter(ADM1_EN == "Bangkok")
spk = adminboundary_bmr %>% filter(ADM1_EN == "Samut Prakan")
par(bg = '#E4D5C9',
mar = c(1,1,1,0),
mfrow=c(1,2))
plot(st_geometry(bkk), main = "Bangkok")
plot(st_geometry(spk), main = "Samut Prakan")
Next, we will create owin objects to represent the observation windows for respective planning area. Once owin objects are created, we will also filter accident locations in each observation window from the original acc_bmr_ppp ppp object.
bkk_owin = as.owin(bkk)
spk_owin = as.owin(spk)
acc_bkk_ppp = acc_bmr_ppp[bkk_owin]
acc_spk_ppp = acc_bmr_ppp[spk_owin]Now that we have prepared both owin and ppp objects for each planning area, we are ready to plot KDE maps. Similar to what we have done in previous section, we will try both fixed-bandwidth and adaptive bandwidth KDE maps.
5.6.1 Province-level Fixed Bandwidth KDE Maps
bkk_kde_scott <- density(acc_bkk_ppp, sigma=bw.scott, main="Bangkok")
spk_kde_scott <- density(acc_spk_ppp, sigma=bw.scott, main="Samut Prakan")
par(bg = '#E4D5C9',
mar = c(1,1,1,1.5),
mfrow = c(1,2))
plot(bkk_kde_scott,
main = "Fixed KDE - Bangkok")
contour(bkk_kde_scott,
add=TRUE)
plot(spk_kde_scott,
main = "Fixed KDE - Samut Prakan")
contour(spk_kde_scott,
add=TRUE)
5.6.2 Province-level Adaptive-Bandwidth KDE Maps
bkk_kde_adaptive_kernel <- adaptive.density(acc_bkk_ppp, method = "kernel")
spk_kde_adaptive_kernel <- adaptive.density(acc_spk_ppp, method = "kernel")
par(bg = '#E4D5C9', mar = c(1,1,1,1.5),mfrow = c(1,2))
plot(bkk_kde_adaptive_kernel,
main = "Adaptive KDE - Bangkok")
contour(bkk_kde_adaptive_kernel,
add=TRUE)
plot(spk_kde_adaptive_kernel,
main = "Adaptive KDE - Samut Prakan")
contour(spk_kde_adaptive_kernel,
add=TRUE)
Observations
The planning area-level KDE maps highlight some limitations when applied to smaller regions like provinces, despite their strength in visualising spatial data.
One issue is that fixed KDE maps tend to over-smooth the data, making it hard to identify key details, while adaptive KDE maps sometimes under-smooth, leading to too much noise. This makes it challenging to extract meaningful insights from the data.
Additionally, because KDE calculates values using grid pixels and Euclidean distance, the resulting grid blocks limit our ability to see detailed differences within smaller areas such as the example below. This restricts our ability to identify finer patterns within each region, reducing the effectiveness of the analysis at a more localised level.
6 Network Constrained Kernel Density Estimation (NKDE)
7 Temporal Network Kernel Density Estination (TNKDE)
8 Conclusion
9 References
The analysis approach will be threefold. Firstly, it compares temperature differentials between urban and rural landscapes. Subsequently, it investigates variations in temperature between wet and dry seasons. Lastly, it delves into the comparison of temperatures across different years.
flowchart TD
A[MSS Temperature Records] --> A1[Urban vs Rural]
A1 -.-> A11[Urban: Changi]
A1 -.-> A12[Rural: Tengah]
A --> A2[Seasons]
A2 -.-> A21[Dry: June]
A2 -.-> A22[Wet: December]
A --> A3[Across Years]
A3 -.-> A31[1983 to 2023]